[SOUND] This lecture is about
natural language content analysis.
As you see from this picture,
this is really the first step
to process any text data.
Text data are in natural languages.
So, computers have to understand
natural languages to some extent in
order to make use of the data, so
that's the topic of this lecture.
We're going to cover three things.
First, what is natural language
processing, which is a main technique for
processing natural language
to obtain understanding?
The second is the State of the Art in NLP,
which stands for
natural language processing.
Finally, we're going to cover the relation
between natural language processing and
text retrieval.
First, what is NLP?
Well, the best way to
explain it is to think about,
if you see a text in a foreign
language that you can't understand.
Now, what you have to do in
order to understand that text?
This is basically what
computers are facing.
Right?
So, looking at the simple sentence like,
a dog is chasing a boy on the playground.
We don't have any problems
understanding this sentence, but
imagine what the computer would have
to do in order to understand it.
For in general,
it would have to do the following.
First, it would have to know dog is
a noun, chasing's a verb, et cetera.
So, this is a code lexile analysis or
part of speech tagging.
And, we need to pick out the,
the syntaxing categories of those words.
So, that's a first step.
After that, we're going to figure
out the structure of the sentence.
So for example, here it shows that a and
dog would go together
to form a noun phrase.
And, we won't have dog and
is to go first, right.
And, there are some structures
that are not just right.
But, this structure shows what we might
get if we look at the sentence and
try to interpret the sentence.
Some words would go together first, and
then they will go together
with other words.
So here, we show we have noun phrases
as intermediate components and
then verb phrases.
Finally, we have a sentence.
And, you get this structure, we need to
do something called a syntactic analysis,
or parsing.
And, we may have a parser,
a computer program that would
automatically create this structure.
At this point, you would know
the structure of this sentence, but
still you don't know
the meaning of the sentence.
So, we have to go further
through semantic analysis.
In our mind,
we usually can map such a sentence to what
we already know in our knowledge base.
And for example, you might imagine
a dog that looks like that,
there's a boy and
there's some activity here.
But for computer,
will have to use symbols to denote that.
All right.
So, we would use the symbol
d1 to denote a dog.
And, b1 to denote a boy, and then p1
to denote the playground, playground.
Now, there is also a chasing
activity that's happening here, so
we have the relation chasing here,
that connects all these symbols.
So, this is how a computer would obtain
some understanding of this sentence.
Now from this representation, we could
also further infer some other things,
and we might indeed, naturally think
of something else when we read text.
And, this is call inference.
So for example, if you believe
that if someone's being chased and
this person might be scared.
All right.
With this rule,
you can see computers could also
infer that this boy may be scared.
So, this is some extra knowledge
that you would infer based on
some understanding of the text.
You can even go further to understand the,
why the person said this sentence.
So, this has to do with
the use of language.
All right.
This is called pragmatic analysis.
In order to understand the speech
actor of a sentence, all right,
we say something to
basically achieve some goal.
There's some purpose there and
this has to do with the use of language.
In this case, the person who said
the sentence might be reminding
another person to bring back the dog.
That could be one possible intent.
To reach this level of understanding,
we would require all these steps.
And, a computer would have to go
through all these steps in order to
completely understand this sentence.
Yet, we humans have no
trouble with understand that.
We instantly, will get everything,
and there is a reason for that.
That's because we have a large
knowledge base in our brain, and
we use common sense knowledge
to help interpret the sentence.
Computers, unfortunately,
are hard to obtain such understanding.
They don't have such a knowledge base.
They are still incapable of doing
reasoning and uncertainties.
So, that makes natural language
processing difficult for computers.
But, the fundamental reason why the
natural language processing is difficult
for computers is simple because natural
language has not been designed for
computers.
They, they, natural languages
are designed for us to communicate.
There are other languages designed for
computers.
For example, program languages.
Those are harder for us, right.
So, natural languages is designed to
make our communication efficient.
As a result,
we omit a lot of common sense knowledge
because we assume everyone
knows about that.
We also keep a lot of ambiguities
because we assume the receiver, or
the hearer could know how to
discern an ambiguous word,
based on the knowledge or the context.
There's no need to invent a different
word for different meanings.
We could overload the same word with
different meanings without the problem.
Because of these reasons,
this makes every step in natural language
of processing difficult for computers.
Ambiguity's the main difficulty, and
common sense reasoning is often required,
that's also hard.
So, let me give you some
examples of challenges here.
Conceded the word-level ambiguities.
The same word can have different
syntactical categories.
For example,
design can be a noun or a verb.
The word root may have multiple meanings.
So, square root in math sense,
or the root of a plant.
You might be able to
think of other meanings.
There are also syntactical ambiguities.
For example, the main topic of this
lecture, natural language processing,
can actually be interpreted in two ways,
in terms of the structure.
Think for a moment and
see if you can figure that out.
We usually think of this as
processing of natural languages, but
you could also think of this as you say,
language process is natural.
Right.
So, this is example of syntatic ambiguity.
Where we have different
structures that can be
applied to the same sequence of words.
Another example of ambiguous
sentence is the following,
a man saw a boy with a telescope.
Now, in this case, the question is,
who had the telescope?
All right, this is called a prepositional
phrase attachment ambiguity,
or PP attachment ambiguity.
Now, we generally don't have a problem
with these ambiguities because we have
a lot of background knowledge to
help us disintegrate the ambiguity.
Another example of difficulty
is anaphora resolution.
So, think about the sentence like John
persuaded Bill to buy a TV for himself.
The question here is,
does himself refer to John or Bill?
So again, this is something that
you have to use some background or
the context to figure out.
Finally, presupposition
is another problem.
Consider the sentence,
he has quit smoking.
Now this obviously
implies he smoked before.
So, imagine a computer wants to understand
all the subtle differences and meanings.
They would have to use a lot of
knowledge to figure that out.
It also would have to maintain a large
knowl, knowledge base of odd meanings of
words and how they are connected to our
common sense knowledge of the word.
So this is why it's very difficult.
So as a result we are still not perfect.
In fact, far from perfect in understanding
natural languages using computers.
So this slide sort of gives a simplified
view of state of the art technologies.
We can do part of speech
tagging pretty well.
So, I showed minus 7% accuracy here.
Now this number is obviously
based on a certain data set, so
don't take this literally.
All right, this just shows that
we could do it pretty well.
But it's still not perfect.
In terms of parsing,
we can do partial parsing pretty well.
That means we can get noun phrase
structures or verb phrase structure, or
some segment of the sentence understood
correctly in terms of the structure.
And, in some evaluation
results we have seen about 90%
accuracy in terms of partial
parsing of sentences.
Again, I have to say, these numbers
are relative to the data set.
In some other data sets,
the numbers might be lower.
Most of existing work has been
evaluated using news data set.
And so, a lot of these numbers are more or
less biased towards news data.
Think about social media data.
The accuracy likely is lower.
In terms of semantic analysis,
we are far from being able to do
a complete understanding of a sentence.
But we have some techniques
that would allow us to do
partial understanding of the sentence.
So, I could mention some of them.
For example, we have techniques that can
allow us to extract the entities and
relations mentioned in text or articles.
For example, recognizing
the mentions of people, locations,
organizations, et cetera in text.
Right?
So this is called entity extraction.
We may be able to recognize the relations.
For example,
this person visited that per, that place.
Or, this person met that person, or
this company acquired another company.
Such relations can be extracted
by using the current and
natural languaging processing techniques.
They are not perfect, but
they can do well for some entities.
Some entities are harder than others.
We can also do word sentence
disintegration to some extent.
We have to figure out whether this word in
this sentence would have certain meaning,
and in another context,
the computer could figure out
that it has a different meaning.
Again, it's not perfect but
you can do something in that direction.
We can also do sentiment analysis meaning
to figure out whether sentence
is positive or negative.
This is a special use for, for
review analysis for example.
So these examples of semantic analysis.
And they help us to obtain partial
understanding of the sentences.
Right?
It's not
giving us a complete understanding as
I showed before for the sentence, but
it will still help us gain understanding
of the content and these can be useful.
In terms of inference,
we are not yet there,
probably because of the general difficulty
of inference and uncertainties.
This is a general challenge
in artificial intelligence.
That's probably also because we don't have
complete semantic reimplementation for
natural language text.
So this is hard.
Yet in some domains, perhaps in
limited domains when you have a lot of
restrictions on the world of users,
you may be to may be able to perform
inference to some extent, but in general
we cannot really do that reliably.
Speech act analysis is also
far from being done, and
we can only do that analysis for
very special cases.
So, this roughly gives you some
idea about the state of the art.
And let me also talk a little
bit about what we can't do.
And, and so we can't even do
100% part of speech tagging.
This looks like a simple task,
but think about the example here,
the two uses of off may have different
syntactic categories if you try
to make a fine grain distinctions.
It's not that easy to figure
out such differences.
It's also hard to do general
complete the parsing.
And, again this same sentence
that you saw before is example.
This, this ambiguity can be
very hard to disambiguate.
And you can imagine example where you
have to use a lot of knowledge i,
in the context of the sentence or
from the background in order to figure
out the, who actually had the telescope.
So is, i, although sentence looks very
simple, it actually is pretty hard.
And in cases when the sentence is
very long, imagine it has four or
five prepositional phrases, then there
are even more possibilities to figure out.
It's also harder to precise
deep semantic analysis.
So here's example.
In this sentence, John owns a restaurant,
how do we define owns exactly?
The word, own, you know,
is something that we can understand but
it's very hard to precisely describe
the meaning of own for computers.
So as a result we have robust and
general natural language processing
techniques that can process a lot of text
data in a shallow way,
meaning we only do superficial analysis.
For example, part of s,
of speech tagging, or
partial parsing, or recognizing sentiment.
And those are not deep understanding
because we're not really
understanding the exact
meaning of the sentence.
On the other hand, the deep understanding
techniques tend not to scale up well,
meaning that they would fail
on some unrestricted text.
And if you don't restrict
the text domain or
the use of words, then these
techniques tend not to work well.
They may work well, based on machine
learning techniques on the data
that are similar to the training data
that the program has been trained on.
But they generally wouldn't work well on
the data that are very different from
the training data.
So this pretty much summarizes the state
of the art of natural language processing.
Of course, within such a short amount
of time, we can't really give you a,
a complete view of any of it, which is a
big field, and either expect that to have,
to see multiple courses on natural
language processing topic itself.
But, because of it's relevance to the
topic that we talked about it's useful for
you to know the background in case
you haven't been exposed to that.
So, what does that mean for
text retrieval?
Well, in text retrieval we
are dealing with all kinds of text.
It's very hard to restrict
the text to a certain domain.
And we also are often dealing with
a lot of text data, so that means.
The NLP techniques must be general,
robust, and efficient and that
just implies today we can only use fairly
shallow NLP techniques for text retrieval.
In fact,
most search engines today use something
called a bag of words representation.
Now this is probably the simplest
representation you can probably think of.
That is to turn text data
into simply a bag of words.
Meaning we will keep the individual words
but we'll ignore all the orders of words.
And we'll keep duplicated
occurrences of words.
So this is called a bag
of words representation.
When you represent the text in this way,
you ignore a lot about the information,
and that just makes it harder
to understand the exact meaning of
a sentence because we've lost the order.
But yet, this representation tends
to actually work pretty well for
most search tasks.
And this is partly because the search
task is not all that difficult.
If you see matching of some of the query
words in a text document, chances
are that that document is about the topic,
although there are exceptions, right?
So in comparison some other tasks, for
example machine translation, would require
you to understand the language accurately,
otherwise the translation would be wrong.
So in comparison,
search tasks are solved relatively easy
such a representation is often sufficient.
And that's also the representation
that the major search engines today,
like Google or Bing are using.
Of course I put in in parentheses but
not all.
Of course there are many queries that are
not answered well by the current search
engines, and
they do require a representation
that would go beyond bag
of words representation.
That would require more natural
language processing, to be done.
There is another reason why we have
not used the sophisticated NLP
techniques in modern search engines, and
that's because some retrieval techniques
actually naturally solve
the problem of NLP.
So, one example,
is word sense disambiguation.
Think about a word like java.
It could mean coffee or
it could mean program language.
If you look at the word
alone it would be ambiguous.
But when the user uses the water in
the query, usually there are other words.
For example I'm looking for
usage of Java applet.
When I have applet there that
implies Java means program language.
And that context can help us naturally
prefer documents where Java is
referring to program language,
because those documents would
probably match applet as well.
If java occurs in the document
in a way that means coffee,
then you would never match applet,
or with very small probability.
Right.
So this is a case when some retrieval
techniques naturally achieve the goal
of word sense disambiguation.
Another example is some technique called
feedback which we will talk about
later in some of the lectures.
This tech, technique would allow us
to add additional words to the query.
And those additional words could
be related to the query words.
And these words can help match documents
where the original query words
have not occurred.
So this achieves, to some extent,
semantic matching of terms.
So those techniques also helped us
bypass some of the difficulties
in natural language processing.
However, in the long run, we still need
deeper natural language processing
techniques in order to improve the
accuracy of the current search engines.
And it's particularly needed for complex
search tasks, or for question answering.
Google has recently
launched a knowledge graph.
And this is one step toward that goal,
because knowledge graph would contain
entities and their relations.
And this goes beyond the simple
bag of words representation.
And such technique should help us improve
the search engine utility significantly,
although this is a still open topic for
research and exploration.
In sum, in this lecture we'll talk
about what is NLP and we've talked
about the state of the art techniques,
what we can do, what we cannot do.
And finally, we also explained
why bag of words representation
remains the dominant representation used
in modern search engines even though
deeper NLP would be needed for
future search engines.
If you want to know more you can take
a look at some additional readings.
I only cited one here.
And that's a good starting point though.
Thanks.
[MUSIC]

